{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <center>RegEx in Python</center>\n",
    "\n",
    "![](images/memes/meme31.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Zero-width assertions\n",
    "\n",
    "- Characters which indicate positions rather than actual content are called **zero-width assertions**.\n",
    "\n",
    "\n",
    "- For instance, the caret symbol (`^`) is a representation of the beginning of a line or the dollar sign (`$`) for the end of a line. \n",
    "\n",
    "\n",
    "- They effectively do assertion without consuming characters; they just return a positive or negative result of the match.\n",
    "\n",
    "\n",
    "- A more powerful kind of **zero-width assertion** is **look around**, a mechanism with which it is possible to match a certain previous (**look behind**) or ulterior (**look ahead**) value to the current position.\n",
    "\n",
    "\n",
    "# Look around\n",
    "\n",
    "\n",
    "**Look around** is a simple mechanism which during the matching process, at the current position, looks forward (or behind, depends on type of lookaround used) to see if **some** pattern matches before continuing with the actual match.\n",
    "\n",
    "The most important thing to understand here is that **look around** mechanism consists of 2 parts:\n",
    "- **actual expression**: an expression whose match constitutes the final **result**.\n",
    "- **non-consuming expression**: an expression whose match is evaluated before the actual expression, just to see if it can succeed. It is **not actually consumed** by the regex engine.\n",
    "    - If the non-consuming match **succeeds**, the regex engine forgets about this non-consuming expression and starts evaluating the next character from the current position of the actual expression. \n",
    "    - If the non-consuming match **does not succeed**, we simply move to next character of the given text and repeat the whole match process again.\n",
    "\n",
    "There are 2 main categories of **look around**  which, in turn, have 2 sub-categories each.\n",
    "\n",
    "![](images/lookaround.png)\n",
    "\n",
    "Let's explore each one of them one by one."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Look ahead\n",
    "\n",
    "**Look ahead** mechanism checks the match for a non-consuming expression **ahead** of a given pattern.\n",
    "\n",
    "\n",
    "## Positive look ahead\n",
    "\n",
    "- **Positive look ahead** will succeed if the passed non-consuming expression **does match** against the forthcoming input.\n",
    "\n",
    "- The syntax is `A(?=B)` where `A` is the **actual expression** and `B` is the **non-consuming expression**. \n",
    "\n",
    "\n",
    "Let's check out an example to understand the concept. Let's assume that we want to find a match for `love` in the given text only if it is followed by `regex`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "from utils import highlight_regex_matches"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "txt = \"i love python, i love regex\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile('love regex')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "match = pattern.search(txt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(17, 27)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "match.span()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['love regex']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern.findall(txt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "i love python, i \u001b[43m\u001b[1mlove regex\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "highlight_regex_matches(pattern, txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, a total of 10 (index 17 to 27) characters, i.e. `love regex` are consumed to search for the given pattern in the text.\n",
    "\n",
    "Now consider the regex pattern `love(?=\\sregex)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile(\"love(?=\\sregex)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "match = pattern.search(txt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(17, 21)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "match.span()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "i love python, i \u001b[43m\u001b[1mlove\u001b[0m regex\n"
     ]
    }
   ],
   "source": [
    "highlight_regex_matches(pattern, txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, using **positive look ahead** mechanism, we consumed only 4 (index 17 to 21) characters are consumed for the match.\n",
    "\n",
    "Let us check out another example to find all words in given text which are followed by `.` or `,`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "txt = \"My favorite colors are red, green, and blue.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile(\"\\w+(?=,|\\.)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['red', 'green', 'blue']"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern.findall(txt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "My favorite colors are \u001b[43m\u001b[1mred\u001b[0m, \u001b[43m\u001b[1mgreen\u001b[0m, and \u001b[43m\u001b[1mblue\u001b[0m.\n"
     ]
    }
   ],
   "source": [
    "highlight_regex_matches(pattern, txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Negative look ahead\n",
    "\n",
    "- **Negative look ahead** will succeed if the passed non-consuming expression **does not match** against the forthcoming input.\n",
    "\n",
    "- The syntax is `A(?!B)` where `A` is the **actual expression** and `B` is the **non-consuming expression**. \n",
    "\n",
    "\n",
    "Let's assume that we want to find a match for `love` in the given text only if it is NOT followed by `regex`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "txt = \"i love python, i love regex\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile(\"love(?!\\sregex)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "i \u001b[43m\u001b[1mlove\u001b[0m python, i love regex\n"
     ]
    }
   ],
   "source": [
    "highlight_regex_matches(pattern, txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/memes/meme32.jpg)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
